File Formats
From Files to Data
In order to read a file you need to know a few things:
- What distinguishes one record from another?
- What distinguishes one field from another?
- What ensures that a field or record is valid?
- Does the data set have row or column names? (a.k.a. headers & metadata)
- Is the metadata in a separate file or embedded in the file?
Structure of a Tabular Data File
Row and column names make it a lot easier to find and refer to data (e.g. the ‘East of England row’ or the ‘Total column’) but they are not data and don’t belong in the data set itself.
Usually, one record (a.k.a. observation) finishes and the next one starts with a ‘newline’ (\n) or ’carriage return (\r) or both (\r\n) but it could be anything (e.g. EOR).
Usually, one field (a.k.a. attribute or value) finishes and the next one starts with a comma (,) which gives rise to CSV (Comma-Separate Values), but it could be tabs (\t) or anything else too (; or | or EOF).
How would we choose a good field separator?
Pro tip: if we store column and row names separately from the data then we can access everything easily without having to factor in any ‘special’ values!
Noice also the nd here. This is the escape sequence again that you also encountered when dealing with the Shell as well. Remember that \ is necessary if you have a space in your file name or path.
Don’t Reinvent the Wheel
Reading data is a very common challenge so… there is a probably a package or class for that! You don’t need to tell Python how to read Excel files, SQL files, web pages… find a package that does it for you!
What you do want, if possible, is a tool that makes it easy to take the ‘native’ data types of another format and sensibly convert those to the closest equivalent in Python with minimal effort.
Most Common Formats
| Extension | Field Separator | Record Separator | Python Package |
|---|---|---|---|
.csv |
, but separator can appear in fields enclosed by ". |
\n but could be \r or \r\n. |
csv |
.tsv or .tab |
\t and unlikely to appear in fields. |
\n but could be \r or \r\n. |
csv (!) |
.xls or .xlsx |
Binary, you need a library to read. | Binary, you need a library to read. | xlrd/xlsxwriter |
.sav or .sas |
Binary, you need a library to read. | Binary, you need a library to read. | pyreadstat |
.json, .geojson |
Complex (,, [], {}), but plain text. |
Complex (,, [], {}), but plain text |
json, geojson |
.feather |
Binary, you need a library to read. | Binary, you need a library to read. | pyarrow, geofeather |
.parquet |
Binary, you need a library to read. | Binary, you need a library to read. | pyarrow |
One of the reasons we like CSV and TSV files is that they can be opened and interacted with using the Command Line (as well as Excel/Numbers/etc.) directly. As soon as you get into binary file formats you either need the original tool (and then export) or you need a tool that can read those formats. So the complexity level rises very quickly.
Of course, sometimes you can gain (e.g. SPSS or SAS) in terms of obtaining information about variable types, levels, etc. but usually you use these when that’s all that’s available or when you want to write a file for others to use.
The two formats at the bottom of the table are there because they are useful: the feather format was designed for fast reads and for data interachange with R, while Parquet is a highly-compressed, column-oriented storage format for large data. So for modest-sized data sets (a few hundred MB), or situations where you are working across R and Python, then Feather cannot be beat. For ‘big data’ where you need access to parts of the data set and want to do lazy loading, then parquet is the winner.
‘Mapping’ Data Types
You will often see the term ‘mapping’ used in connection to data that is not spatial, what do they mean? A map is the term used in some programming languages for a dict! So it’s about key : value pairs again.
Here’s a mapping
| Input (e.g. Excel) | Output (e.g. Python) |
|---|---|
| NULL, N/A, “” | None or np.nan |
| 0..n | int |
| 0.00…n | float |
| True/False, Y/N, 1/0 | bool |
| R, G, B (etc.) | int or str (technically a set, but hard to use with data sets) |
| ‘Jon Reades’, ‘Huanfa Chen’, etc. | str |
| ‘3-FEB-2020’, ‘10/25/20’, etc. | datetime module (date, datetime or time) |
These would be a mapping of variables between two formats. We talk of mapping any time we are taking inputs from one data set/format/data structure as a lookup for use with another data set/format/data structure.
Have a think about how you can use an int to represent nominal data. There are two ways: one of which will be familiar to students who have taken a stats class (with regression) and one of which is more intuitive to ‘normal’ users…
Testing a Mapping
Working out an appropriate mapping (representation of the data) is hugely time-consuming.
It’s commonly held that 80% of data science is data cleaning.
The Unix utilities (grep, awk, tail, head) can be very useful for quickly exploring the data in order to develop a basic understanding of the data and to catch obvious errors.
You should never assume that the data matches the spec.
Why This Isn’t Easy
Here’s raw Excel data.
What would we say the row and column names currently are?
Label These
Metadata is relevant to our understanding of the data and so is important, but it’s not relevant to treating the data as data so we need to be able to skip it.
Column names are going to be how we access a given attribute for each observation.
Row names are not normally data themselves, but are basically labels or identifiers for observations. Another term for this would be the data index.
If we store row and column names/indices separately from the data then we don’t have to treat them as ‘special’ or factor them into, for example, the calculation of summary stats.
Also have to consider trade-offs around mapping the full column names on to something a little faster and easier to type!
Things That Can Go Wrong…
A selection of real issues I’ve seen in my life:
- Truncation: server ran out of diskspace or memory, or a file transfer was interrupted.
- Translation: headers don’t line up with data.
- Swapping: column order differs from spec.
- Incompleteness: range of real values differs from spec.
- Corruption: field delimitters included in field values.
- Errors: data entry errors resulted in incorrect values or the spec is downright wrong.
- Irrelevance: fields that simply aren’t relevant to your analysis.
These will generally require you to engage with columns and rows (via sampling) on an individual level.
> Why We Need an Arrow, a Duck, and a Parquet
These three interconnected technologies (and there are others too!) have changed my life. Sad as that sounds, it has reduced the time need to load and query large data sets from minutes to milliseconds. And it has solved many of the errors that I just listed above!
Arrow and Parquet
- Arrow is an in-memory columnar format for data. Data is stored in a structured way in RAM making it blazingly fast for operations.
- Parquet is a highly-compressed columnar file format for data. Data is stored in a structured way on your hard drive.
- Feather is a raw storage format for Arrow.
TL;DR: for most applications Parquet will give nice, small files on disk and the benefits of columnar file storage; for computationally intensive applications where disk space and interoperability with other systems isn’t an issue then Feather might work.
What About the Duck?
- Serverless SQL queries against Parquet files
- Queries returned as Pandas data frames
- Select and filter before loading
- Fast conversion between CSV and Parquet via Arrow
For (Later) Reference
# Notice the engine and dtype_backend options
df = pandas.read_csv(fname, engine='pyarrow',
dtype_backend='pyarrow')
# And for parquet files
df = pandas.read_parquet(fname, columns=[...])
# And for DuckDB we can actually joing two
# files before they even get to Python!
q = f'''
SELECT * FROM
read_parquet('epc-ppd-2022-*.parquet') as ppd,
read_parquet('epc-ldd-2022-*.parquet') as ldd,
WHERE ppd.uid=ldd.uid
'''
df = duckdb.query(q).df()P.S. There’s also a command-line tool for DuckDB so you don’t even need Python.